An Efficient Approximation Scheme for Data Mining Tasks
نویسندگان
چکیده
We investigate the use of biased sampling according to the density of the dataset, to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional datasets. In densitybiased sampling, the probability that a given point will be included in the sample depends on the local density of the dataset. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest, and can be tuned for specific data mining tasks. This allows great flexibility, and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.
منابع مشابه
A Composite Finite Difference Scheme for Subsonic Transonic Flows (RESEARCH NOTE).
This paper presents a simple and computationally-efficient algorithm for solving steady two-dimensional subsonic and transonic compressible flow over an airfoil. This work uses an interactive viscous-inviscid solution by incorporating the viscous effects in a thin shear-layer. Boundary-layer approximation reduces the Navier-Stokes equations to a parabolic set of coupled, non-linear partial diff...
متن کاملOn Approximation Algorithms for Data Mining Applications
We aim to present current trends in the theoretical computer science research on topics which have applications in data mining. We briefly describe data mining tasks in various application contexts. We give an overview of some of the questions and algorithmic issues that are of concern when mining huge amounts of data that do not fit in main memory.
متن کاملApproximate Privacy-Preserving Data Mining on Vertically Partitioned Data
In today’s ever-increasingly digital world, the concept of data privacy has become more and more important. Researchers have developed many privacy-preserving technologies, particularly in the area of data mining and data sharing. These technologies can compute exact data mining models from private data without revealing private data, but are generally slow. We therefore present a framework for...
متن کاملAn Efficient Representation Model of Distance Distribution Between Two Uncertain Objects
In this paper, we consider the problem of efficient computation of distance distribution between two uncertain objects. It is important to many uncertain query evaluation (e.g., range queries, nearest-neighbour queries) and uncertain data mining (e.g., classification, clustering and outlier detection). However, existing approaches involve distance computations between samples of two objects, wh...
متن کاملTowards a Task-Based Assessment of Professional Competencies
Performance assessment is exceedingly considered a key concept in teacher education programs worldwide. Accordingly, in Iran, a national assessment system was proposed by Farhangian University to assess the professional competencies of its ELT graduates. The concerns regarding the validity and authenticity of traditional measures of teachers' competencies have motivated us to devise a localized...
متن کامل